Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[chore] remove unnecessary updating of _worker_names #19

Merged
merged 3 commits into from
Nov 25, 2024

Conversation

kevin85421
Copy link
Contributor

@kevin85421 kevin85421 commented Nov 21, 2024

It looks like that self._worker_names doesn't need to be updated.

self._worker_names.append(name)

@kevin85421
Copy link
Contributor Author

kevin85421 commented Nov 21, 2024

trace the code more. It may be required because the constructor is called inside the same class's method

update: initializes _worker_names in the constructor instead

@PeterSH6
Copy link
Collaborator

I will check if this will bring some conflicts tomorrow.

Initialize the _woker_names in the constructor may cause some confliction to L261 you listed here

@kevin85421
Copy link
Contributor Author

Thanks! @PeterSH6 I will also try to run some examples from my side. Would you mind giving me a pointer about the tests?

@PeterSH6
Copy link
Collaborator

@kevin85421 Thanks! We didn't open-source the tests as we haven't set up the CI in GitHub recently. I will release some relevant tests later.

@@ -187,6 +187,9 @@ def __init__(self,
self.ray_cls_with_init = ray_cls_with_init
self.name_prefix = get_random_string(length=6) if name_prefix is None else name_prefix

if worker_names is not None:
self._worker_names = worker_names
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if adding assert self._is_init_with_detached_workers between L190-191 would be better?

Your modification will not cause errors in current usage but it may be better to add this line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated!

@PeterSH6
Copy link
Collaborator

@kevin85421 I have added some tests for Ray single-controller in #23

These test files would be useful for this PR:

Signed-off-by: Kai-Hsun Chen <[email protected]>
@kevin85421
Copy link
Contributor Author

Thank you for the review! I will spend more time on this repo after Thanksgiving. We are currently working on a PoC for veRL with Ray Compiled Graphs.

In addition, we have limited access to A100s. Are there any examples that can be run on a smaller GPU machine so that I can use for dev?

@PeterSH6
Copy link
Collaborator

@kevin85421 Thanks, I'm looking forward to further collaboration!

You can use a smaller model like gemma2b and a smaller micro-batch size. Here is an example launch script. I've attached an example launch script for you. While I still need to test its algorithmic accuracy and convergence before merging it, it should work well for debugging with 2A100 GPUs.

set -x

python3 -m verl.trainer.main_ppo \
    data.train_files=$HOME/data/gsm8k/train.parquet \
    data.val_files=$HOME/data/gsm8k/test.parquet \
    data.train_batch_size=512 \
    data.val_batch_size=1312 \
    data.max_prompt_length=1024 \
    data.max_response_length=512 \
    actor_rollout_ref.model.path=google/gemma-2-2b-it \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.actor.ppo_mini_batch_size=128 \
    actor_rollout_ref.actor.ppo_micro_batch_size=4 \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.grad_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.rollout.log_prob_micro_batch_size=4 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
    actor_rollout_ref.ref.log_prob_micro_batch_size=4 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    critic.optim.lr=1e-5 \
    critic.model.path=google/gemma-2-2b-it \
    critic.model.enable_gradient_checkpointing=False \
    critic.ppo_micro_batch_size=4 \
    critic.model.fsdp_config.param_offload=False \
    critic.model.fsdp_config.grad_offload=False \
    critic.model.fsdp_config.optimizer_offload=False \
    algorithm.kl_ctrl.kl_coef=0.001 \
    trainer.critic_warmup=0 \
    trainer.logger=['console','tracking'] \
    trainer.project_name='verl_example' \
    trainer.experiment_name='gemma2b_function_rm' \
    trainer.n_gpus_per_node=2 \
    trainer.nnodes=1 \
    trainer.save_freq=-1 \
    trainer.test_freq=10 \
    trainer.total_epochs=15

@PeterSH6 PeterSH6 merged commit 5ff6a63 into volcengine:main Nov 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants